ggplot2 and choosing the right
visualization for your audience“Advanced Graphics and Data Visualization in R” is brought to you by the Centre for the Analysis of Genome Evolution & Function’s (CAGEF) bioinformatics training initiative. This CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. While the datasets and examples used in this course will be centred on SARS-CoV2 epidemiological and genomic data, the lessons learned herein will be broadly applicable.
This lesson is the second in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.
The structure of the class is a code-along style in R markdown notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto datatools Hub so students can program along with the instructor.
This week will delve into some helpful visualization available
through the ggplot2 package. First we’ll go over deciding
which visualizations are best for what we want to accomplish and then
explore some of these in more detail.
At the end of this lecture you will have covered the following topics
grey background - a package, function, code, command or
directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or
folder
bold - heading or a term that is being defined
blue text - named or unnamed
hyperlink
... - Within each coding cell this will indicate an area
of code that students will need to complete for the code cell to run
correctly.
Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn Python
Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.
Each week, new lesson files will appear within your RStudio folders.
We are pulling from a GitHub repository using this Repository
git-pull link. Simply click on the link and it will take you to the
University of Toronto datatools
Hub. You will need to use your UTORid credentials to complete the
login process. From there you will find each week’s lecture files in the
directory /2024-03-Adv_Graphics_R/Lecture_XX. You will find
a partially coded skeleton.Rmd file as well as all of the
data files necessary to run the week’s lecture.
Alternatively, you can download the R-Markdown Notebook
(.Rmd) and data files from the RStudio server to your
personal computer if you would like to run independently of the Toronto
tools.
A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!
As mentioned above, at the end of each lecture there will be a completed version of the lecture code released as a PDF file under the Modules section of Quercus.
Today’s datasets will focus on some more Ontario public health unit data. We’ll deep dive a bit more into the demographics of SARS-CoV-2 infection, hospitalization, and survival.
Retrieved from the open data sets available at Public Health Ontario, this data provides summary statistic information for all PHUs in Ontario including the population size for each PHU.
This data originated from the Ontario Provincial government but was tidied during lecture 01 to reflect a long-format set of data compatible with our needs.
Retrieved from the open data sets available at Public Health Ontario, this dataset focuses on summarizing PHU data across age and sex groups.
tidyverse which has a number of packages including
dplyr, tidyr, stringr,
forcats and ggplot2
viridis helps to create color-blind palettes for our
data visualizations
lubridate and zoo are helper packages used
for working with date formats in R
ggbeeswarm, ggridges and
GGally which are new packages we need to install. They’ll
help us generate some nice graphs.
# ---------- Install packages here ---------- #
# install.packages("ggbeeswarm", dependencies = TRUE)
# install.packages("GGally")
# ---------- Source packages here ---------- #
# Packages to help tidy our data
library(tidyverse)
# Packages for the graphical analysis section
library(viridis)
# New visualisation packages
library(ggridges) # ridgeline plots
## Error in library(ggridges): there is no package called 'ggridges'
library(GGally) # Parallel coordinate plots
## Error in library(GGally): there is no package called 'GGally'
library(ggbeeswarm)
# packages used for working with/formating dates in R
library(lubridate)
library(zoo)
How do we extract meaningful insights from our data? If you have previously explored Anscombe’s quartet you’ll know that, as scientists and lay people, we can sometime be obsessed with summary statistics - mean, median, mode, standard deviation. While these values are a helpful way to quickly assess a population, they can be flawed to the point of deception. Instead, we should temper our investigations by visualizing our data. A deeper understanding of your data trends and potential models comes from dissecting attributes of your data which can jump out more easily through visualization.
Equally important is the ability to convey our findings to others. The right visualizations, whether simplistic or complex, should effectively communicate our key message.
One approach to effective data visualization relies on the Grammar of Graphics framework originally proposed by Leland Wilkinson (2005). The idea of grammar can be summarized as follows:
The grammar of graphics facilitates the concise description of any
components of any graphics. Hadley Wickham of tidyverse
fame has proposed a variant on this concept - the layered
grammar of graphics framework. By following a layered
approach of defined components, it can be easy to build a
visualization.
The Major Components of the Grammar of Graphics by Dipanjan Sarkar
We can break down the above pyramid by the base components, building from the bottom upwards.
1. Data: your visualization always starts here. What are the dimensions you want to visualize. What aspect of your data are you trying to convey?
2. Aesthetics: assign your axes based on the data dimensions you have chosen. Where will the majority of the data fall on your plot? Are there other dimensions (such as categorically encoded groupings) that can be conveyed by aspects like size, shape, colour, fill, etc.
3. Scale: do you need to scale/transform any values to fit your data within a range? This includes layers that map between the data and the aesthetics.
4. Geometric objects: how will you display your data within your
visualization. Which geom_* will you use?
5. Statistics: are there additional summary statistics that should be included in the visualization? Some examples include central tendency, spread, confidence intervals, standard error, etc.
6. Facets: will generating subplots of the data add a dimension to our visualization that would otherwise be lost?
7. Coordinate system: will your visualization follow a classic cartesian, semi-log, polar, etc. coordinate system?
Let’s look at a summarized data set this week with cumulative case counts across all PHUs in Ontario. What is nice about this data set is that it also carries some population data with it to give us a sense of proportions. This data comes from a period where cumulative case counts still had some meaning to them (March 2022) so we will use it to help demonstrate some of the information we want to visualize.
Steps we’ll take in working with this data:
data/COVID-19_map_data_220303.csv with
read_csv()Let’s open that up with our friend read_csv()
# Read in PHU_population_information.csv
phu_information.df <- read_csv("data/COVID-19_map_data_220303.csv")
## [1mRows: [22m[34m35[39m [1mColumns: [22m[34m19[39m
## [36m--[39m [1mColumn specification[22m [36m-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------[39m
## [1mDelimiter:[22m ","
## [31mchr[39m (1): Geographic area
## [32mdbl[39m (18): Recent Case Count, Recent Case Rate, Cumulative Case Count, Cumula...
##
## [36mi[39m Use `spec()` to retrieve the full column specification for this data.
## [36mi[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Check the structure and preview the data
str(phu_information.df)
## spc_tbl_ [35 x 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Geographic area : chr [1:35] "Ontario" "Algoma Public Health" "Brant County Health Unit" "Chatham-Kent Public Health" ...
## $ Recent Case Count : num [1:35] 25751 733 233 322 876 ...
## $ Recent Case Rate : num [1:35] 175 622 152 302 151 ...
## $ Cumulative Case Count : num [1:35] 1107408 5275 9908 6924 48526 ...
## $ Cumulative Case Rate : num [1:35] 7516 4476 6452 6494 8342 ...
## $ Cumulative Hospitalizations Count : num [1:35] 42132 206 369 278 2252 ...
## $ Cumulative Hospitalizations Rate : num [1:35] 286 175 240 261 387 ...
## $ Cumulative Deaths Count : num [1:35] 12497 33 76 72 523 ...
## $ Cumulative Deaths Rate : num [1:35] 84.8 28 49.5 67.5 89.9 62.3 98.2 28.4 70.8 49.8 ...
## $ At least one dose count : num [1:35] 12489709 97881 122613 85128 474994 ...
## $ Completed primary series count : num [1:35] 11947990 93941 117653 81743 454665 ...
## $ At least one dose coverage (%) : num [1:35] 84.8 83.1 79.8 79.8 81.7 84.3 81.7 76.6 76.5 82.6 ...
## $ Completed primary series coverage (%) : num [1:35] 81.1 79.7 76.6 76.7 78.2 81.1 78.1 73.7 73.8 79.6 ...
## $ At least one dose count among eligible : num [1:35] 12486011 97873 122608 85120 474910 ...
## $ Completed primary series count among eligible : num [1:35] 11946134 93935 117650 81737 454628 ...
## $ At least one dose coverage (%) among eligible : num [1:35] 89.1 86.7 84.6 83.8 86 89.3 85.9 80.8 80.4 86 ...
## $ Completed primary series coverage (%) among eligible: num [1:35] 85.3 83.2 81.2 80.5 82.4 85.9 82.2 77.7 77.6 82.9 ...
## $ Population : num [1:35] 14734014 117840 153558 106620 581722 ...
## $ Eligible population : num [1:35] 14010998 112839 144945 101565 552042 ...
## - attr(*, "spec")=
## .. cols(
## .. `Geographic area` = [31mcol_character()[39m,
## .. `Recent Case Count` = [32mcol_double()[39m,
## .. `Recent Case Rate` = [32mcol_double()[39m,
## .. `Cumulative Case Count` = [32mcol_double()[39m,
## .. `Cumulative Case Rate` = [32mcol_double()[39m,
## .. `Cumulative Hospitalizations Count` = [32mcol_double()[39m,
## .. `Cumulative Hospitalizations Rate` = [32mcol_double()[39m,
## .. `Cumulative Deaths Count` = [32mcol_double()[39m,
## .. `Cumulative Deaths Rate` = [32mcol_double()[39m,
## .. `At least one dose count` = [32mcol_double()[39m,
## .. `Completed primary series count` = [32mcol_double()[39m,
## .. `At least one dose coverage (%)` = [32mcol_double()[39m,
## .. `Completed primary series coverage (%)` = [32mcol_double()[39m,
## .. `At least one dose count among eligible` = [32mcol_double()[39m,
## .. `Completed primary series count among eligible` = [32mcol_double()[39m,
## .. `At least one dose coverage (%) among eligible` = [32mcol_double()[39m,
## .. `Completed primary series coverage (%) among eligible` = [32mcol_double()[39m,
## .. Population = [32mcol_double()[39m,
## .. `Eligible population` = [32mcol_double()[39m
## .. )
## - attr(*, "problems")=<externalptr>
head(phu_information.df)
Looking at the data itself, we want to perform the following wrangling actions:
1. Rename the variable names to lower case.
2. In the variable names, replace all of the spaces with
_ characters.
3. Update the values in the Geographic area variable to
remove excess information.
4. Drop the variables Recent Case Count and
Recent Case Rate.
5. Drop all of the vaccination variables (containing the words
“Completed” or “dose”). We’ll use a selection helper
matches() to accomplish this.
6. Remove the prefix word “Cumulative” from any variable names.
7. Rename the Geographic area variable to
public_health_unit.
unique(phu_information.df$'Geographic area')
## [1] "Ontario"
## [2] "Algoma Public Health"
## [3] "Brant County Health Unit"
## [4] "Chatham-Kent Public Health"
## [5] "City of Hamilton Public Health Services"
## [6] "Durham Region Health Department"
## [7] "Eastern Ontario Health Unit"
## [8] "Grey Bruce Health Unit"
## [9] "Haldimand-Norfolk Health Unit"
## [10] "Haliburton, Kawartha, Pine Ridge District Health Unit"
## [11] "Halton Region Public Health"
## [12] "Hastings Prince Edward Public Health"
## [13] "Huron Perth Health Unit"
## [14] "Kingston, Frontenac and Lennox & Addington Public Health"
## [15] "Lambton Public Health"
## [16] "Leeds, Grenville & Lanark District Health Unit"
## [17] "Middlesex-London Health Unit"
## [18] "Niagara Region Public Health"
## [19] "North Bay Parry Sound District Health Unit"
## [20] "Northwestern Health Unit"
## [21] "Ottawa Public Health"
## [22] "Peel Public Health"
## [23] "Peterborough Public Health"
## [24] "Porcupine Health Unit"
## [25] "Public Health Sudbury & Districts"
## [26] "Region of Waterloo Public Health and Emergency Services"
## [27] "Renfrew County and District Health Unit"
## [28] "Simcoe Muskoka District Health Unit"
## [29] "Southwestern Public Health"
## [30] "Thunder Bay District Health Unit"
## [31] "Timiskaming Health Unit"
## [32] "Toronto Public Health"
## [33] "Wellington-Dufferin-Guelph Public Health"
## [34] "Windsor-Essex County Health Unit"
## [35] "York Region Public Health"
# Take our PHU information and tidy it up a bit
phu_information.df %<>%
# Rename variables to lower case
rename_with(str_to_lower) %>%
# Replace variable name spaces with _
rename_with(str_replace_all, pattern=r"(\s)", replacement="_") %>%
# Rename the values in geographic_area
mutate(geographic_area = str_remove_all(.$geographic_area,
pattern=r"(^Public\sHealth\s|\sPublic.*|\sand\sDistrict\s.*|\sDistrict\s.*|\sHealth.*)"
)) %>%
# Drop the "Recent" data
select(-2, -3) %>%
# Remove any variables containing the words "Completed" or "dose"
select(-matches("Completed|dose")) %>%
# Rename the "Cumulative" variables
rename_with(str_replace_all, pattern="cumulative_", replacement = "") %>%
# Rename "geographic_area" to "public_health_unit"
rename(public_health_unit = geographic_area)
head(phu_information.df)
str(phu_information.df)
## tibble [35 x 9] (S3: tbl_df/tbl/data.frame)
## $ public_health_unit : chr [1:35] "Ontario" "Algoma" "Brant County" "Chatham-Kent" ...
## $ case_count : num [1:35] 1107408 5275 9908 6924 48526 ...
## $ case_rate : num [1:35] 7516 4476 6452 6494 8342 ...
## $ hospitalizations_count: num [1:35] 42132 206 369 278 2252 ...
## $ hospitalizations_rate : num [1:35] 286 175 240 261 387 ...
## $ deaths_count : num [1:35] 12497 33 76 72 523 ...
## $ deaths_rate : num [1:35] 84.8 28 49.5 67.5 89.9 62.3 98.2 28.4 70.8 49.8 ...
## $ population : num [1:35] 14734014 117840 153558 106620 581722 ...
## $ eligible_population : num [1:35] 14010998 112839 144945 101565 552042 ...
# View the updated PHU list
unique(phu_information.df$public_health_unit)
## [1] "Ontario"
## [2] "Algoma"
## [3] "Brant County"
## [4] "Chatham-Kent"
## [5] "City of Hamilton"
## [6] "Durham Region"
## [7] "Eastern Ontario"
## [8] "Grey Bruce"
## [9] "Haldimand-Norfolk"
## [10] "Haliburton, Kawartha, Pine Ridge"
## [11] "Halton Region"
## [12] "Hastings Prince Edward"
## [13] "Huron Perth"
## [14] "Kingston, Frontenac and Lennox & Addington"
## [15] "Lambton"
## [16] "Leeds, Grenville & Lanark"
## [17] "Middlesex-London"
## [18] "Niagara Region"
## [19] "North Bay Parry Sound"
## [20] "Northwestern"
## [21] "Ottawa"
## [22] "Peel"
## [23] "Peterborough"
## [24] "Porcupine"
## [25] "Sudbury & Districts"
## [26] "Region of Waterloo"
## [27] "Renfrew County"
## [28] "Simcoe Muskoka"
## [29] "Southwestern"
## [30] "Thunder Bay"
## [31] "Timiskaming"
## [32] "Toronto"
## [33] "Wellington-Dufferin-Guelph"
## [34] "Windsor-Essex County"
## [35] "York Region"
We’re now ready to start visualizing our data, but how will we go about doing it?
Let’s take a look at the following chart from from Dr. Andrew V. Abela.
From the flowchart above we’ll mainly explore looking at relationships and composition using our dataset as a basis for our visualizations. We’ll cover distributions and comparison with a more complex dataset later in this lecture.
Depending on the nature of your data and its dimensions there are a number of plots to choose from to represent and convey relationships between your variables. These various plots can reveal trends, correlations, and ultimately relationships between variables. For instance, as an initial form of graphical assessment, scatterplots build a framework for exploring our data further.
| Relationship Description | Types of charts |
|---|---|
| Finding correlations in your data | Scatterplot |
| Bubble chart | |
| Lineplot | |
| Heatmap | |
| Showing connections between groups | Arc diagram |
| Chord diagram | |
| Connection map | |
| Network diagram | |
| Non-ribbon chord diagram | |
| Tree diagram | |
| Show relationships and connections between the data | Heatmap |
| or showing correlations between two or more variables | Marimekko Chart |
| Parallel coordinates plot | |
| Radar chart | |
| Venn diagram |
In our first lecture, we covered the creation of lineplots. For today, we’ll focus on just two relational plots: the scatterplot and it’s variant the bubblechart. Parallel coordinate plots will also make an appearance later on during this lecture.
geom_point()
graphsWhen we try to examine relationships, one direction we can take is to look for trends by comparing one variable against another. From an experimental standpoint we can graph our independent variable on the x-axis, looking for changes in our dependent variable/measurement on the y-axis. This is most easily accomplished when both of your variables are on a continuous scale.
When trying to show correlations in your data between two to three variables you can use scatterplots. Remember there are some limitations when visualizing multi-dimensional data on a two-dimensional canvas. Overall these plots can convey to your audience the potential correlations between your variables or separations between groups based on their colour or shape.
alpha parameter to help visualize
overlapping data pointsLet’s begin with a scatterplot comparing the number of cases versus
hospitalizations. We’ll use the observations from each PHU as separate
datapoints. Of note, within the geom_point() layer, we’ll
be updating the alpha parameter to change the transparency
of our points.
Using the “alpha” parameter: Note that
alpha = 1 will make completely opaque points while
alpha = 0 will make completely transparent points.
As points begin to overlap, they will create increasingly opaque areas
in your visualization.
# scatter plot of case_count vs hospitalizations_count
phu_information.df %>%
# Drop the first row of "Ontario" data
slice(2:n()) %>%
# 1. Data
ggplot() +
# 2. Aesthetics
aes(x=case_count, y = hospitalizations_count) +
# set text size
theme(text = element_text(size = 10)) +
# 4. Geoms
geom_point(alpha = 0.5)
The bubbleplot, as we’ll see, brings an additional dimension to your visualization by varying the size of your points based on a (usually) continuous variable. This can help to highlight underlying data trends in a more obvious fashion for your audience.
To achieve this we will set the size parameter in our
aesthetics aes() layer. By setting the size
parameter to a variable in our data, it will alter the size or our data
points. We will augment the results of this by providing a size range
through the scale_size() layer.
# bubble plot of case_count vs hospitalizations_count
phu_information.df %>%
# Drop the first row of "Ontario" data
slice(2:n()) %>%
# 1. Data
ggplot() +
# 2. Aesthetics
aes(x=case_count, y = hospitalizations_count,
size = population) + ## Set the size aesthetic
# set text size
theme(text = element_text(size = 10)) +
# 3. Scaling
## set the range for sizing our dots
scale_size(range = c(1,10), name = "Population (M)") +
# 4. Geoms
geom_point(alpha = 0.5)
From above we can see a lot of values are bunched up together at the lower end of our scales and likewise we have a few values that are far out on our x/y axes. To even this out, we can transform our axes to display values on a log10 scale. We have two places where we can accomplish this:
log(case_count). Data
will be placed on a log value scale but may have less meaning to the lay
person.# bubble plot of case_count vs hospitalizations_count
phu_information.df %>%
# Drop the first row of "Ontario" data
slice(2:n()) %>%
# 1. Data
ggplot() +
# 2. Aesthetics
aes(x=case_count, y = hospitalizations_count, size = population) +
# set text size
theme(text = element_text(size = 10)) +
# 3. Scaling
scale_size(range = c(1, 10), name = "Population (M)") + # set the range for sizing our dots
## Set the x and y-axis to log scales
scale_x_log10() +
scale_y_log10() +
# 4. Geoms
geom_point(alpha = 0.5)
From our bubbleplot it is clear that rising case counts are tied to increasing population size. The relationship between our log-transformed data appears to be linear.
How do you interpret log-transformed data?: After log-transforming both axes it looks like the relationship between overall cases and hospitalizations is linear. While the transformation has definitely improved the visualization, it has made the interpretation a little harder. Overall, in this situation, you’ll either want to model against the untransformed data or do the math correctly and recognize that the relationship is a power function:
Many students are familar with the classic bar chart. Thus far, we’ve already used it to some extent to look at our last lecture’s PHU case data. When comparing groups or populations for differences in their distribution, however, the bar chart falls quite short of the ideal.
When comparing distributions, we again are often concerned with summary statistics - mean, median, standard deviation, confidence intervals. The nature of our data, however, is often not continuous across an entire y-axis interval as a bar chart may suggest but rather our population is the result of values centred, with some variance, around a mean value.
Over the remainder of the lecture we will compare and contrast 4 types of distribution plots for their relevance and when they may be most useful to visualize our data. When applicable to our data visualizations, we will also display all of our data points to better illustrate our distribution versus the visualization. We will examine:
geom_bar() or
geom_col() to display dataAs we have already seen, the geom_bar() layer can be
used to display our data in multiple ways. The height of the bar is
proportional to either:
1. the number of observations of each group or
2. the sum of weights applied by a weight aesthetic
In our Lecture 01 examples we actually used the
value of our observations to determine height and proportions of our
groups by setting the parameter stat="identity". This was
the application of a weighting based on the values of the y-axis
variable we chose.
A simpler way to accomplish this effect is with
geom_col(), which already uses stat_identity()
by default to calculate proportions.
| geom | Description | Requirements |
|---|---|---|
| geom_bar() | Counts observations in your data and (by default) determines height as a proportion of total (by default) | Only accepts x OR y parameter in
aes() |
| geom_col() | Uses y-axis values as the height of each bar. | Requires both x AND y parameters in
aes() |
Both of these plots use position="stack" by default and
proportions of height match observations or sums for multiple values
sharing the same x position. Such instances can be
displayed independently using position="dodge" or
position=dodge2. This is only helpful, however, when the
number of x values (ie categories) is lower, otherwise the
graph becomes crowded.
Let’s begin with a simple barplot of SARS-Cov-2-related deaths
accumulated over the course of the pandemic. Using
phu_information.df as a data source we will convey the
distribution of total deaths per PHU using the geom_col()
layer.
Recall the first row of our data is “Ontario”, representing the total
values for each variable as a sum of the other PHUs. We’ll remove that
from our data using slice() before passing it on to
ggplot().
# barplot of death counts across PHUs
phu_information.df %>%
# Remove the top row (Ontario data)
slice(2:n()) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = public_health_unit, y = deaths_count) +
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
# 4. Geoms
geom_col() ## Add our bars
reorder() to sort your data as you plot
itIn the case of phu_information.df we can do a little
better by ordering our x-axis of PHUs by total
population or death counts. Since our data
table is quite simple and only a single observation for each PHU occurs,
there are a number of ways to accomplish a sort:
population to a factor and use
fct_reorder() to sort our factors.arrange() before
plotting.reorder() function while generating the
plot.Let’s sort our data by PHU population size in our next example using
the R base function reorder() within our
ggplot() call.
# barplot using the reorder function
phu_information.df %>%
# Remove the top row (Ontario data)
slice(2:n()) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
### 3.1.1 We must choose a variable to sort by in the reorder function - use population
aes(x = reorder(public_health_unit, -population),
y = deaths_count) +
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
# 4. Geoms
geom_col()
From above, using the baplots we can quickly gauge the difference between PHUs as they get sorted by population. There’s not much need to colour by population as well but it does add some emphasis to Toronto as a larger population and further enforces the idea that we have sorted by population size.
What are we missing from this visualization that would help the reader? It would be nice to know just how much variation there is in population size, especially in the first ~12 PHUs. How much bigger is Peel versus the Region of Waterloo?
fill parameter to distinguish between
groupsOne simple way to enhance our barplot is through the use of fill
colour. By setting the fill parameter to
population we can shade each bar based on the population
size of each PHU. Let’s take a look.
# barplot using the reorder function
phu_information.df %>%
# Remove the top row (Ontario data)
slice(2:n()) %>%
# We must choose a variable to sort by in the reorder function - use population
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = reorder(public_health_unit, -population), y = deaths_count, fill = population) +
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
### 3.2.1 Adjust the fill legends
guides(fill = guide_colorbar(title="Population\nsize")) +
# 4. Geoms
geom_col()
geom_* layers to more accurately
reflect valuesAs you can see from above, unless you have a very good eye for shades, there isn’t much to help us distinguish between populations again. We can still see that Toronto has closer to more than 3M residents but other than that, we again have some issues with discriminating between populations.
Instead of fill colour, we can add a little detail by layering an
additional geom. In this case, we will use geom_point() to
add the population of each PHU after scaling it down to
1/1000th.
Why scale population by 1/1000th? Why did we choose to scale the population size of each PHU to 1:1000? This is more of a decision based on what we know about the data already. We already know our y-axis deaths_count data scales from 0 to 4000. Knowing the populations can scale from 0 to 3M, it makes sense to try and get our values along the same scale.
# barplot using the reorder function
phu_information.df %>%
# Remove the top row (Ontario data)
slice(2:n()) %>%
# We must choose a variable to sort by in the reorder function - use population
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = reorder(public_health_unit, -population), y = deaths_count) +
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
guides(size = guide_legend(title="Population\nsize per\n1000"),
colour = "none") +
# 4. Geoms
geom_col() +
### 3.2.2 Add a geom_point to represent population size
geom_point(aes(y=..., size=2, colour="red"))
## Error in layer(data = data, mapping = mapping, stat = stat, geom = GeomPoint, : '...' used in an incorrect context
Now we have more clarity on populations sizes between PHUs, answering our original question that Peel region has nearly one million more inhabitants than the Region of Waterloo. To publish this figure we would want to fix a few additional things like the legend presentation, the axis names and their titles. We’ll drill into these ideas more next lecture!
On this same topic, let’s visit one last plot that can gives us some pieces from the two variants of barplots we’ve used. The lollipop graph clarifies that x-axis values are more singular in nature rather than spanning a range while still visually connecting those y-axis values to their x-axis categories.
It looks very much like it sounds, and we’ll add an extra twist to
ours by setting the point size to the population size, giving it just a
bit more of informational dimension. To accomplish this visualization
we’ll combine a geom_point() with a
geom_segment().
Aesthetics when working with multiple geoms: As you’ll see in the following code, we’ve made some adjustments due to working with multiple geom_* layers. Since there can be overlapping aesthetics between geoms, you need to be cognizant of their effects. Rather than set these parameters in the aes() layer, set them directly in their respective geoms.
# scatter-style using the reorder function
phu_information.df %>%
# Remove the top row (Ontario data)
slice(2:n()) %>%
# Arrange our data here to clean up the code
arrange(desc(population)) %>%
# Use the updated order to reorder the public_health_unit variable
mutate(public_health_unit = factor(public_health_unit, levels = .$public_health_unit)) %>%
ggplot(.) +
# 2. Aesthetics
aes(x = public_health_unit, y = deaths_count) +
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
guides(size = guide_legend(title="Population\nsize"),
colour = "none") + # Set color legend to none to remove it
# 3. Scaling
# set the range for sizing our dots
scale_size(range = c(1, 10), name = "Population (M)") +
# 4. Geoms
### 3.2.3 Make the stick of the lollipop
...(aes(x=public_health_unit, xend=...,
y=..., yend=...)) +
### 3.2.3 Put the candy on top
geom_point(aes(size = population), colour = "orchid")
## Error in ...(aes(x = public_health_unit, xend = ..., y = ..., yend = ...)): could not find function "..."
Let’s revisit our public health unit data from last lecture. We’ll
use an updated long-format version that we created and saved in a csv
format. As you might recall this dataset contains a column of dates with
each row representing an observation of new_cases reported
by a public_health_unit on that date. We have a 4th column
total_phu_new representing total cases reported across all
PHUs on that specific date.
| date | public_health_unit | new_cases | total_phu_new |
|---|---|---|---|
| Date format: YYYY-MM-DD | factor of 34 PHUs | numeric:double | numeric:double |
| … | … | … | … |
In our first lecture we looked at the other helpful visualization that barplots can produce: composition and proportions. Here we can combine the proportions of categories of data to help convey an added dimension to our data. By stacking PHUs in our “new case” data we are now plotting new cases per month for multiple PHUs. The area of each stack, however, also gives us a sense of proportions for each PHU that we’ve added.
# Open our data file and define the column types rather than let the function decide.
covid_phu.df <- read_tsv(...,
col_types=(...)) # Define column types here
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Take a peek at the data
head(covid_phu.df)
## Error in head(covid_phu.df): object 'covid_phu.df' not found
tail(covid_phu.df)
## Error in tail(covid_phu.df): object 'covid_phu.df' not found
lubridate packageLast week we breezed over the idea of working with the date format in R. It can be a rather esoteric subject but it all comes down to how we as a society track dates. For R, the date is stored as an integer values representing the number of days since 1970-01-01. So yes, there must exist some “negative” dates.
The lubridate package is here to help you simplify
working with dates and has a number of helpful functions that can be
used to extract information from your date. For us, we’ll use the
floor_date() function to help round our dates down to their
Year-Month format. This function takes just two
parameters:
date: your date or vector of dates to
convert
unit: the unit by which you want to round down -
this could be year, month, …, second (if your date is tracking that
specifically)
Note that our dates will still be in a date
object format!
# What do our dates look like?
head(covid_phu.df$date)
## Error in head(covid_phu.df$date): object 'covid_phu.df' not found
# What do they look like after rounding down by month?
floor_date(covid_phu.df$date, unit = ...) %>% head()
## Error in head(.): '...' used in an incorrect context
scale_x_date() layer to set your x-axis
informationLet’s use this to convert our date values for plotting as bar graphs!
We’ll convert our x-axis values in our aes() call. Our data
only goes out to January 25, 2023 so we’ll set our limits to accommodate
that point.
Note also, our scale_x_date() layer. We used this last
week to help set how our x-axis is displayed the data including
parameters like:
limits: where our x-axis begins and ends
date_breaks: how we would like our x-axis
ticks to break up
date_labels: how we want our x-axis ticks
to be labeled
expand: determine if we would like additional
padding on the sides of our x-axis. This is a 4-element vector but can
also accept 2 values c(mult, add) where you may either
expand outwards by multiplying limits by mult or you can
pad the limits with the unit value (days in this case) of
add.
Through some experimentation, we’ll set our limits to ~1/2 month before our start date and end dates. We’ll aim to display our data from December 2022 to January 2023 but remember that this data is floored to the first date of each month!
# Start with our PHU covid data
covid_phu.df %>%
# Filter to just look at a few PHUs
filter(public_health_unit %in% c("Toronto", "Peel", "York Region", "Ottawa")) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = ..., ## Floor the date
y= ...,
# set our fill colour instead of line colour
fill = fct_reorder(public_health_unit, new_cases, .desc=TRUE)) +
theme(text = element_text(size = 20)) + # set text size
guides(fill = guide_legend(title="Public Health Unit")) +
xlab("Date") + # Set the x-axis label
ylab("New cases") + # Set the y-axis label
ggtitle("New cases per day across all Ontario Public Health Units") +
# 3. Scaling
### 3.3.2 Scale our dates from mid-November to Mid-January.
scale_x_date(limits = c(as.Date("2021-11-15"), as.Date("2023-01-15")),
date_breaks = "1 month", date_labels = "%b-%Y",
expand = c(0,5)) + # Pad the x-axis by 5 days
scale_fill_viridis_d() + # the "d" stands for discrete colour scale
# 4. Geoms
geom_col() # Set up our barplot here
## Error in filter(., public_health_unit %in% c("Toronto", "Peel", "York Region", : object 'covid_phu.df' not found
coord_polar()Generally speaking there are a number of circular or polar coordinate
plot variants ranging in complexity from the pie chart to a racetrack
chart. The coord_polar() layer takes a few parameters:
theta: determines if angles will be mapped to the x or
y variablestart: offset from the 12 o’clock positiondirection: 1 = clockwise, -1 = counter-clockwiseHere’s a summary
| Chart name | Plot description | Uses | Cartesian analog |
|---|---|---|---|
| Pie chart | Sliced up proportions of a circle | Simple proportion comparison between groups | single bar variant |
| Nightingale/Coxcomb plot | Intervals are equal wedges but shaded areas represent proportions | Intervals with additional categorical proportions | Stacked column plot, theta=x |
| Race track plot | Intervals are split concentrically to form rings of equal width | Helpful if some groups are less diverse | Stack column plot, theta=y |
From our pie chart description and from your own experience, you can recall that pie charts aren’t helpful with complex data. Once the number of categories surpasses 6-8 groups, things can get hard to visually digest - especially if one category dominates the others.
Let’s use phu_information.df to assess the cumulative
COVID-19 case data from across our PHUs to look at the top 4 PHUs by
caseload. We’ll convert a geom_bar() barplot to our pie
chart using the coord_polar() layer.
To accomplish this feat it helps to think about how the data is being handled. If you think about a pie chart, you can imagine it much like a stacked barplot where the top and bottom have been curved around to meet each other. This is a plot all about proportions and instead of having an x-axis defined, it is the proportions for a single category or group.
Therefore, we set the aes() parameter x to
a value of "" so there are no categorical values for
x but you will still need to set a y value
from which to determine those proportions.
# Go to the simple phu information table
phu_information.df %>%
# Sort for only a few PHUs
filter(public_health_unit %in% c("Toronto", "Peel", "York Region", "Ottawa")) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
### 3.4.1 Set the x axis correctly
aes(x=..., y=...,
fill = ...) + # set our fill colour instead of line colour
theme_void() + # This will remove all background theme objects
theme(text = element_text(size = 20)) + # set text size
guides(fill = guide_legend(title="Public Health Unit")) +
xlab("") + # Set the x-axis label
ylab("New cases") + # Set the y-axis label
ggtitle("Total cases by health unit") +
# 3. Scaling
scale_fill_viridis_d() + # the "d" stands for discrete colour scale
# 4. Geoms
geom_bar(width=..., stat=..., colour = ...) + # Set up our barplot here
# 7 Coordinate system
### 3.4.1 Map angles to the Y-axis
...
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
Made famous by Florence Nightingale, these plots (which she named coxcomb plots) were used to help emphasize the proportions of soldier deaths due to infection during the Crimean war. Each circular increment is equally spaced to represent increasing values along the y-axis. Visually, this gives more area representation as we radiate outwards on the chart. This produces a chart that
Florence Nightingale’s original rose chart/coxcomb plot publication. Sourced from the Wikimedia commons
Caution: outer segments disproportionately represent increases in value. Their larger area produces more visual emphasis despite the linear scale of the y-axis.
We’ll switch back to our PHU daily case data from
covid_phu.df to generate some stackable information.
Know your y-limits! Note that due to the disproportionate area increases with increasing y-axis values, you need to consider the range of your data. We’ve seen the high-values in our January 2022 data so we’ll set the x-axis scale again with scale_x_date() to range from February 2022 to January 2023.
# Start with our PHU covid data
covid_phu.df %>%
# Filter to just look at a few PHUs
filter(public_health_unit %in% c("Toronto", "Peel", "York Region", "Ottawa")) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = floor_date(date, unit = "month"), y= new_cases,
# set our fill colour instead of line colour
fill = fct_reorder(public_health_unit, new_cases, .desc=TRUE)) +
theme_bw() + # make the theme more plain
theme(text = element_text(size = 20)) + # set text size
theme(panel.grid.major.y = element_line(color="grey60")) + # darken our major y grid
guides(fill = guide_legend(title="Public Health Unit")) +
xlab("Date") + # Set the x-axis label
ylab("New cases") + # Set the y-axis label
ggtitle("New cases per day across all Ontario Public Health Units") +
# 3. Scaling
scale_x_date(limits = c(as.Date("2022-01-15"), as.Date("2023-01-15")),
# Scale the dates correctly to get a month-Year format
date_breaks = "1 month", date_labels = "%b-%Y", expand = c(0,0)) +
scale_fill_viridis_d() + # the "d" stands for discrete colour scale
# 4. Geoms
### 3.4.2 Set up our barplot here
... +
# 7 Coordinate system
### 3.4.2 Convert the barplot to a polar coordinate
...
## Error in filter(., public_health_unit %in% c("Toronto", "Peel", "York Region", : object 'covid_phu.df' not found
You want to make another cool chart from a boring bar chart. Stacked
or otherwise, you can convert those bar plots to a racetrack or radial
bar chart. The only difference between the coxcomb plot and radial bar
chart is that instead of the default theta="x" we set it to
y. Like flipping the coordinates for a bar chart.
Beware the racetrack transformation: This is an aesthetic transformation where each bar gets relatively longer as you move from inside to out. Therefore, values must be judged by radians! Our eyes can more precisely judge differences on a Cartesian bar chart. Thus when viewing these types of charts, don’t be misled by how they look at a casual glance!
Let’s make a set of barchart data and compare it to the racetrack
plot. Note that negative values in our dataset are plotted separately.
It is an implementation quirk of ggplot.
# Start with our PHU covid data
covid_phu.df %>%
# Filter to just look at a few PHUs
filter(public_health_unit %in% c("Toronto", "Peel", "York Region", "Ottawa")) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = floor_date(date, unit = "month"), y= new_cases,
# set our fill colour instead of line colour
fill = fct_reorder(public_health_unit, new_cases, .desc=TRUE)) +
theme(text = element_text(size = 20)) + # set text size
guides(fill = guide_legend(title="Public Health Unit")) +
xlab("Date") + # Set the x-axis label
ylab("New cases") + # Set the y-axis label
ggtitle("New cases per day across all Ontario Public Health Units") +
# 3. Scaling
scale_x_date(limits = c(as.Date("2022-01-15"), as.Date("2023-01-15")),
date_breaks = "1 month", date_labels = "%b-%Y", expand = c(0,5)) +
scale_fill_viridis_d() + # the "d" stands for discrete colour scale
# 3. Scaling
... + ### 3.4.3 Flip the x and y axes
# 4. Geoms
geom_col() # Set up our barplot here
## Error in filter(., public_health_unit %in% c("Toronto", "Peel", "York Region", : object 'covid_phu.df' not found
# Start with our PHU covid data
covid_phu.df %>%
# Filter to just look at a few PHUs
filter(public_health_unit %in% c("Toronto", "Peel", "York Region", "Ottawa")) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = floor_date(date, unit = "month"), y= new_cases,
# set our fill colour instead of line colour
fill = fct_reorder(public_health_unit, new_cases, .desc=TRUE)) +
theme_bw() + # make the theme more plain
theme(text = element_text(size = 20)) + # set text size
theme(panel.grid.major.y = element_line(color="grey60")) + # darken our major y grid
guides(fill = guide_legend(title="Public Health Unit")) +
xlab("Date") + # Set the x-axis label
ylab("New cases") + # Set the y-axis label
ggtitle("New cases per day across all Ontario Public Health Units") +
# 3. Scaling
scale_x_date(limits = c(as.Date("2022-01-15"), as.Date("2023-01-15")),
# Scale the dates correctly to get a month-Year format
date_breaks = "1 month", date_labels = "%b-%Y", expand = c(0,5)) +
scale_fill_viridis_d() + # the "d" stands for discrete colour scale
# 4. Geoms
geom_col() + # Set up our barplot here
# 7 Coordinate system
### 3.4.3 No need to flip the coordinates, just set theta=y
coord_polar(...)
## Error in filter(., public_health_unit %in% c("Toronto", "Peel", "York Region", : object 'covid_phu.df' not found
Often for our purposes as scientists we are more concerned with the distribution of replicates or measurements as a population within a group whilst also comparing across other groups (ie control vs treatment). Naively, we are tempted by the convenience of Excel to simplify our data as a bar chart with some simple standard deviation or error bars. One word concisely describes this behaviour:
Note from our examples that although we are comparing public health units, our datapoints represent single or total observations separated either by group or by time. At each x-axis point we are not comparing the distribution between categories in a meaningful way. In fact, from our first example, we can convey a near-exact message using a dot-plot! In essence these visualizations are communicating the categorization of samples across multiple groups. More or less, this is about visualizing proportions.
Section 3.0.0 Comprehension Question: After exploring the racetrack variant above, visualizing our data ranging from February 2022 to January 2023, what is the biggest issue you might see? What would be one way to remedy this? Is this an appropriate visualization of our data?
# Code your plot here!
What is the shape of our data? When we encounter experimental populations and begin sampling or measure for specific characteristics, we inevitably begin to reveal whether or not that characteristic has a predictable range, median value, etc. Density plots, once given enough sample values, begin to predict the shape or distribution of that sampling. We sometimes use this to represent the theoretical distribution of our original populations.
In contrast to our previous datasets, we are interested in dissecting
group characterstics after sampling multiple times. Therefore, before
continuing on our journey, we need to find some data that is better
analysed as population data. Let’s open up some more SARS-CoV-2 data
from Ontario PHUs that has been broken into age and sex demographics:
Ontario_age_sex_COVID-19_cases.csv. We’ll see what we can
glean from the data.
Before we can do any visualization we’ll want to tidy up the data in the following steps:
read_csv()After all of our transformations, we’ll have specific proportions of infections, deaths, and hospitalizations within each PHU for each age group.
# Open our dataset
covid_demographics.df <- read_csv(...)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Take a quick look at it
str(covid_demographics.df, give.attr = FALSE)
## Error in str(covid_demographics.df, give.attr = FALSE): object 'covid_demographics.df' not found
tail(covid_demographics.df)
## Error in tail(covid_demographics.df): object 'covid_demographics.df' not found
# Format our dataset to focus on total_cases and total_deaths
covid_demographics_total.df <-
covid_demographics.df %>%
#---------- Fix column names ----------#
# convert column names to lowercase and replace spaces with "_". Does this look familiar?
rename_with(., ~ str_to_lower(str_replace_all(string = .x,
pattern = r"(\s)",
replacement = "_"))) %>%
# clean up PHU names
mutate(geographic_area = str_remove_all(geographic_area,
pattern=r"((^Public\sHealth\s)|(\sPublic.*)|(\sHealth.*))"),
period = str_remove_all(string = period, pattern = r"(\sCOVID.*)")
) %>%
#---------- Filter and fix variables ----------#
# Filter out some of the excess data
filter(geographic_area != "Ontario", # Drop Ontario data
age_group != "All ages") %>% # Drop combined age data
# Pare down the dataset to just total_cases, total_deaths, and total_hospitalizations
select(period, from_date, to_date, geographic_area, age_group,
total_cases, total_rate,
total_hospitalizations_count, total_hospitalizations_rate,
male_cases, female_cases) %>%
# Convert the age_group into a factor
mutate(age_group = factor(age_group),
# after converting to factor, the "5 to 11" will be in the wrong position. Relevel age_group
age_group = fct_relevel(age_group, "5 to 11", after=1)) %>%
# Rename the geographic_area variable
rename(public_health_unit = geographic_area) %>%
#---------- Summarize data ----------#
# Group the data so you can manipulate based on PHU in the next steps
group_by(period, public_health_unit) %>%
# generate percent values for each age group within a PHU
mutate(percent_cases = ...,
# generate percent hospitalizations for each age group within a PHU
percent_hospitalizations = total_hospitalizations_count/sum(total_hospitalizations_count),
# generate percent male and female cases for each age group within a PHU
# We'll use this data later!
percent_male_cases = male_cases/(total_cases),
percent_female_cases = female_cases/(total_cases)
)
## Error in covid_demographics.df %>% rename_with(., ~str_to_lower(str_replace_all(string = .x, : '...' used in an incorrect context
# Take a look at the different age demographics
levels(covid_demographics_total.df$age_group)
## Error in levels(covid_demographics_total.df$age_group): object 'covid_demographics_total.df' not found
str(covid_demographics_total.df, give.attr = FALSE)
## Error in str(covid_demographics_total.df, give.attr = FALSE): object 'covid_demographics_total.df' not found
While bar plots help to focus on proportional representation for categorical data, both density plots and histograms can be used to convey the frequency or distribution of values for a variable across its specified range. When comparing distributions visualized this way, you can usually compare up to 3 or 4 on the same plot before the data becomes too crowded. These methods also give you a sense of your data before moving forward with more detailed analyses. You can also identify possible mistakes or artefacts in data collection.
How much data do I need? Density plots can be thought of as smoothed versions of histograms which have been binned in small intervals. Density plots of a single dimension require a minimum of 4 samples but justifying a KDE on a sample size that small is hard. Histograms are recommended to have at least 30 observations to be considered useful and I would apply this rule of thumb to KDEs as well.
Let’s pick a few age groups to plot based on
percent_cases. We’ll use data gathered from each PHU to see
if the same general trends (if any) apply regardless of PHU.
covid_demographics_total.df %>%
# Filter for just a few age groups
filter(period == "cumulative",
age_group %in% c("0 to 4","12 to 19", "20 to 39", "80+")) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = ..., fill = ...) +
theme(text = element_text(size = 20)) + # set text size
# 4. Geoms
geom_density(...) + ## generate our KDE
...(aes(colour = age_group)) ## confirm our data values with a geom_rug
## Error in filter(., period == "cumulative", age_group %in% c("0 to 4", : object 'covid_demographics_total.df' not found
facet_*() to view KDEs separatelyAs you can see from above, as we start to have more and more age
groups, it may be better to separate them out in order to avoid too much
overlap. It may also be to our advantage to compare them in a more
vertical fashion. Recall that there are two layers we can choose from:
facet_wrap() and facet_grid(). They are
differentiated by the following characteristics:
facet_wrap(): Data is split based on available data
combinations of the facets parameter which can be defined
by a formula like ~var1+var2 or by using the quoting
function vars(var1, var2). In either case, the data is then
grouped by your variables. Panels are placed in a single ribbon
that wraps around based on the arguments in the ncol and
nrow parameters.facet_grid(): Data is split based on the 1
or 2-dimensional combination of facet variables. Even
combinations for which there is no data will be displayed as empty
panels. Use the rows and cols parameters to
set the facets by using the vars() quoting function. Note
that you could further group your rows and/or columns by multiple
variables.Note we’ll also use another parameter in both called
scales where we can determine if panels should share axis
scales or change them based on individual panel data in the y-axis
(free_y), x-axis (free_x) or both
(free),
Let’s first use facet_wrap() to generate a single-column
facet of KDEs based on the cumulative period from our data.
covid_demographics_total.df %>%
# filter for only cumulative data
filter(period == "cumulative") %>%
# Select for just the important columns
select(public_health_unit, age_group, percent_cases) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = percent_cases, fill = age_group) +
theme(text = element_text(size = 20)) + # set text size
# 4. Geoms
geom_density(alpha = 0.5) +
# 6. Facets
### 4.1.1 Add a simple facet wrap, with each age group existing in its own row
facet_wrap(...,
scales=...,
ncol=...)
## Error in filter(., period == "cumulative"): object 'covid_demographics_total.df' not found
facet_grid() to view both periods of our
dataIf, for example we wanted to directly compare the data from both
periods (recent and cumulative), we could alter the above plot so that
our two periods are paneled beside each other. This can be accomplished
with the facet_grid() layer which will allow us to name
both a rows and cols grouping parameter.
Let’s update our figure so that we split our rows by
age_group and our columns by period.
covid_demographics_total.df %>%
# Select for just the important columns
select(period, public_health_unit, age_group, percent_cases) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = percent_cases, fill = age_group) +
theme(text = element_text(size = 20)) + # set text size
# 4. Geoms
geom_density(alpha = 0.5) +
# 6. Facets
## Use a facet_grid to panel data by multiple variables
facet_grid(rows = ...,
cols = ...,
scales="free") # Allow allow scales to alter between rows, and between columns.
## Error in select(., period, public_health_unit, age_group, percent_cases): object 'covid_demographics_total.df' not found
While the above visualization is pretty clear, it certainly takes up a lot of extra space. Perhaps there is a better way to generate this kind of visualization?
Ridgeline plots (sometimes called Joyplots) can generate a compact way to show multiple distributions of numeric values across several groups. The distributions can be shown as either histograms or density plots with all of them aligned to the same horizontal axis. The vertical axis is compressed slightly to generate an overlap between categories.
To generate these visualizations we can use the ggridges
package which is an extension of ggplot2. In this case,
that means it uses the same grammar and can be added as a layer call to
geom_density_ridges(). A parameter to keep in mind:
scale: sets the vertical distance between
ridgelines
1: the tallest density curve just touches the baseline of the next vertical category
Above 1: increasing overlap
Below 1: increasing separation
Let’s begin by replicating our faceted plot from above which compares the cumulative vs recent data across age groups.
covid_demographics_total.df %>%
# Select for just the important columns
select(period, public_health_unit, age_group, percent_cases) %>%
# Plot a ridgeline plot
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = percent_cases, y = age_group, fill = age_group) +
theme_bw() + # Simplify the underlying theme
theme(text = element_text(size = 20)) + # set text size
theme(legend.position = "none") + # drop the legend since it's redundant
# 4. Geoms
### 4.2.0 Add a ridgeline plot instead of KDE
... +
# 6. Facets
### 4.2.0 Make panels to display our period data
facet_grid(..., scales = "free_x")
## Error in select(., period, public_health_unit, age_group, percent_cases): object 'covid_demographics_total.df' not found
geom_density_ridges_gradient() to fill
densities on a gradientFrom above you can now see that we’ve somewhat compacted all that dimensional data into a single plot that still clearly conveys the difference in overall proportions for total infections within each age group. The distributions across our categories suggest that the 20-39 age group makes up a larger proportion of overall cumulative cases within each PHU. On the other hand, the 20-80 age ranges all appear to have similar distributions of cases in the most recent data, although that may be a less reliable indicator of the true case spread.
For our audience, we would need to clean up our axis titles to
clarify that these proportions are calculated independently. An
additional option with your ridgeline plots is the fill variant. To
accomplish a nicer gradient we will include a call to
scale_fill_viridis_c since our x-axis is continuous. Keep
in mind that we also cannot set the alpha transparency on
our density plots when filling with a gradient. We also have to set our
aesthetics fill to stat(x) to accomplish this
feat as well.
covid_demographics_total.df %>%
# Select for just the important columns
select(period, public_health_unit, age_group, percent_cases) %>%
# Plot a ridgeline plot
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = percent_cases, y = age_group, fill = stat(x)) +
theme_bw() + # Simplify the underlying theme
theme(text = element_text(size = 20)) + # set text size
theme(legend.position = "none") + # drop the legend since it's redundant
# 3. Scaling
### 4.3.0 Use the Magma option
scale_fill_viridis_c(name="Percent of PHU", option = "C") +
# 4. Geoms
## Use a gradient version of ridges instead
... +
# 6. Facets
facet_grid(~period, # Equivalent to row = vars(period)
scales = "free_x")
## Error in select(., period, public_health_unit, age_group, percent_cases): object 'covid_demographics_total.df' not found
Did you know the boxplot is nearly 50 years old! First invented in the 1970s by our favourite statistician, John Wilder Tukey, we’ll dig into how and when to use this iconic plot. While we’re here we’ll also take a look at other categorical distribution plots. While our KDE and ridgeline plots provide quite a bit of detail, they can also be a little more limited in their space efficiency. The following categorical distribution plots will perhaps provide some more information efficiency.
geom_boxplot()As you can see from the previous section, we comfortably fit a quite few distributions on a ridgeline plot. From the looks of it, the 20-39 age group looks to make up a higher percent of cases across all the PHUs. Previously this data was broken down by 10-year groupings but it has since been amended to make larger age groups in the 20+ range. Still, we can still explore this data a little closer.
Can we visualize the data in a more summarized form? Let’s explore the box plot.
The dissection of a boxplot’s components shows us how it summarizes data distribution.
Also known as the box and whisker plot, this visualization conveys the distribution of samples within a group or population and is built upon 5 principal values:
“Box”
median: dark line across centre of box
lower quartile: lower-bound of box
upper quartile: upper-bound of box
“Whiskers”
Together, the lower and upper quartiles produce the interquartile
range (IQR). The general implementation of boxplots classify any
observations 1.5 IQR above the upper quartile or below the lower
quartile as outliers of the distribution. The
characteristics of outliers can be set as parameters within the
geom_boxplot() layer. Parameters include
outlier.shape, outlier.size, and
outlier.colour.
Unlike a histogram, the minimum number of values to generate a boxplot is 5. While you could generate a boxplot on fewer numbers, you might not have actual whiskers! This is definitely a great alternative when sample sizes are between 5-30 for each population.
Historically this was a simple way to visualize summary statistics of population while being easy to produce by hand. Of course, with the age of computing, the production of kernel density estimates have allowed for more diverse visualizations. This plot, however, remains a popular format and thus is more readily understood by general audiences.
Each compact box can take up the same space as a barplot column but
it gives much more information about the population. Let’s look at a
single aspect - percent_cases in the cumulative
period.
# Generate a basic box plot with outliers present
covid_demographics_total.df %>%
# Filter for cumulative data
filter(period == "cumulative") %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=..., y = ...) +
theme(text = element_text(size = 20)) + # set text size
# 4. Geoms
... ### 5.1.0 Add the boxplot geom
## Error in filter(., period == "cumulative"): object 'covid_demographics_total.df' not found
From the looks of it, we can confirm what we saw before in our density plot - that the 20-39 age group generates the highest percentage of cases across PHUs and that with increasing age, the number of reported cases decreases as a proportion of the total.
We can add a few extra items to the plot to help us visualize the data:
add our data points with geom_jitter() and remove
outliers from the boxplot to avoid double-plotting points.
mark a ~95% confidence interval within the shape of each box plot
with the notch parameter.
you can set a variable width for each boxplot that is based on the number of samples used to generate the plot.
# Generate a basic box plot with outliers present
covid_demographics_total.df %>%
# Filter for cumulative data
filter(period == "cumulative") %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=age_group, y = percent_cases) +
theme(text = element_text(size = 20)) + # set text size
theme(legend.position = "none") +
# 4. Geoms
geom_boxplot(outlier.shape=..., ### 5.2.0 Remove outliers
notch = ...) + ### 5.2.0 Add a confidence interval notch
### 5.2.0 Add datapoints
...
## Error in filter(., period == "cumulative"): object 'covid_demographics_total.df' not found
geom_beeswarm takes plotting your points up to
the next levelAs you can see from above the geom_jitter() layer does
add points to our boxplot by plotting the points such that they avoid
overlapping as much as possible. Points are restricted to the width of
the boxplot although this can also be adjusted to some degree with the
right parameters. geom_jitter() is native to the
ggplot2 package with some parameters that allow for a more
“random” distribution of your data points within a provided area.
The goal of the ggbeeswarm package is to generate points
that will not overlap but they can also be used to
simultaneously simulate the kernel density of your data. There are two
geoms supplied that work with the ggplot2 package to
accomplish this:
geom_beeswarm() has a number of parameters that can be
used to set their aes() mappings but also how the points
are laid out.
priority: determines the method used to perform the
point layout.
groupOnX: if TRUE then jitter is added to the x axis
(default behaviour) otherwise jitter along the y axis.
dodge.width: the amount by which different
aesthetics groups (must be a factor) will be dodged.
show.legend: determines of the this layer should be
included in the legends.
NA (yes if aesthetics are mapped),
FALSE (never include), and TRUE (always
include)cex: an optional parameter that can be used to help
set the spacing between values.
# Generate a basic box plot with outliers present
covid_demographics_total.df %>%
# Filter for cumulative data
filter(period == "cumulative") %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=age_group, y = percent_cases) +
theme(text = element_text(size = 20)) + # set text size
theme(legend.position = "none") +
# 4. Geoms
geom_boxplot(outlier.shape=NA, # Remove outliers
notch = TRUE) + # Add a confidence interval notch
### 5.3.0 Add datapoints as a beeswarm
...
## Error in filter(., period == "cumulative"): object 'covid_demographics_total.df' not found
geom_quasirandom() adds KDE structure to your
plottingAs we can see from above, the geom_beeswarm() layer adds
a little more structure to the data in a somewhat rigid way. Any
datapoints that are near each other are deliberately spaced out to
almost represent the distribution of your data. Of course, you may run
into some issues as your number of datapoints increases or as your data
range increases (see the 20 to 39 age group).
To remedy this, we can balance the visualization a bit with the
geom_quasirandom() function.
geom_quasirandom() works similarly to the
geom_beeswarm() function with emphasis on an additional
method of how the points are plotted:
method: determine the method for distributing the
points. Options include:
varwidth: if TRUE, vary the width of each group based
on the number of samples in the group# Generate a basic box plot with outliers present
covid_demographics_total.df %>%
# Filter for cumulative data
filter(period == "cumulative") %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=age_group, y = percent_cases) +
theme(text = element_text(size = 20)) + # set text size
theme(legend.position = "none") +
# 4. Geoms
geom_boxplot(outlier.shape=NA, # Remove outliers
notch = TRUE) + # Add a confidence interval notch
### 5.3.1 Add datapoints as a quasirandom beeswarm
...
## Error in filter(., period == "cumulative"): object 'covid_demographics_total.df' not found
Now our data points take a more nuanced approach with a uniform width that shapes the data as a distribution. We’ll see this with even more emphasis in a few sections.
When given the need to show data distribution, try the quasirandom plotting of points over a simple beeswarm.
fillIs there more to the data we’ve visualized? We can add a third
dimension in a number of ways but the simplest would be to compare the
proportions of total cases vs totals hospitalizations over the course of
this pandemic. To do so, we can pivot our dataset to collapse
percent_cases and percent_hospitalizations
together into a single variable.
From there we’ll use the fill parameter to generate
different subgroups in our boxplot to make a grouped boxplot. In doing
so, we’ll also have to add the use of the
geom_quasirandom() parameters:
dodge.width: separate the data points by any
aesthetic groups that have been assigned.
width: set the maximium spread of your data points
within each grouping
alpha: set the opacity of your data points so you
can see more of the overlapping data.
Lastly, we’ll facet the plot by period so we can, yet again, compare the cumulative data vs a more recent snapshot of the data.
covid_demographics_total.df %>%
# Select for just the important columns in our analysis
select(period, public_health_unit, age_group, percent_cases, percent_hospitalizations) %>%
# Pivot the modified table to capture the "stat_group" of percent_cases vs percent_hospitalizations
pivot_longer(cols=c(4:5), names_to = "stat_group", values_to = "percent_PHU_total") %>%
# Plot the data as a grouped boxplot
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=age_group, y = percent_PHU_total,
fill = ...) + ### 5.4.0 Alter our fill aesthetic
theme(text = element_text(size = 20)) + # set text size
# 4. Geoms
geom_boxplot(outlier.shape=NA, notch = TRUE) + # Add the boxplot geom
# Add a quasirandom distribution of our data points
geom_quasirandom(dodge.width = ..., width = ..., alpha = 0.5) +
# 6. Facet
### 5.4.0 Produce our data based on period
facet_wrap(~period, nrow = 2, scales = "free_y")
## Error in select(., period, public_health_unit, age_group, percent_cases, : object 'covid_demographics_total.df' not found
In our boxplots, we are plotting the data in both a boxplot and dotplot format. The shape of the dots helped to give a better sense of PHU distribution within each age group. You can see that the data points overlay on the box but also fall outside. Can we get the compact nature of the boxplot while still getting the visual appeal generated by a density plot?
As the title says, the violin plot is a mixture of both the boxplot
and kernel density plot. It’s a miniaturized KDE that is mirrored across
its axis. It encompasses the summary information of the boxplot but in a
pear or violin-shaped distribution. To generate a
geom_violin() in ggplot, a minimum of three
values are required. To justify using a violin, I would again suggest
sticking to a similar rule of thumb of a minimum ~30
samples/observations to ensure an accurate representation of the
distribution.
The nuanced visualization of a violin plot gives much more information than the box plot itself and most boxplots can be replaced with a violin plot. Despite the gateway to more detailed distribution information, this format remains less popular/familiar to scientists. Therefore its immediate accessibility to your audience can be limited.
An important parameter of this geom is scale: this sets
how big each violin is in comparison to each other. It accepts the
following values:
area: default value so all violins will have the
same area.
count: scale areas proportionately to
observations.
width: all violins have the same maximum width.
Total area is not the same for each.
Outliers and violins: Much like a KDE, the theoretical distribution of a violin plot can generate some impossible values - especially at the tails. Remember that this is a theoretical distribution based on the data supplied. Depending on how much variation is in your data, and how many outliers it has, it can really affect the shape of your violin.
covid_demographics_total.df %>%
# Filter for only cumulative data
filter(period == "cumulative") %>%
# Select for just the important columns in our analysis
select(period, public_health_unit, age_group, percent_cases, percent_hospitalizations) %>%
# Pivot the modified table to capture the "stat_group" of percent_cases vs percent_hospitalizations
pivot_longer(cols=c(4:5), names_to = "stat_group", values_to = "percent_PHU_total") %>%
# Plot the data as a grouped boxplot
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=age_group, y = percent_PHU_total, fill = stat_group) +
theme(text = element_text(size = 20)) + # set text size
# 3. Scaling
scale_y_continuous(limits = c(0, 0.6)) + # Set the y-axis limit
# 4. Geoms
... + ## Add the violin geom
# Add a quasirandom distribution of our data points
geom_quasirandom(dodge.width = 0.85)
## Error in filter(., period == "cumulative"): object 'covid_demographics_total.df' not found
The major advantage to the violin plot is that, by it’s nature, it is very sensitive to the distribution that produces the density estimate. The boxplot represents the summary information of a distribution but is always a visual representation of a normal distribution. There are not enough parameters supplied to represent anything more complex!
The violin plot is not limited in that respect. Despite some of it’s visual caveats, it can certainly detect multi-modal data. Let’s make a toy example to illustrate.
covid_demographics_total.df %>%
# Filter for only cumulative data
filter(period == "cumulative") %>%
# Select for just the important columns in our analysis
select(period, public_health_unit, age_group, percent_cases, percent_hospitalizations) %>%
# Pivot the modified table to capture the "stat_group" of percent_cases vs percent_hospitalizations
pivot_longer(cols=c(4:5), names_to = "stat_group", values_to = "percent_PHU_total") %>%
# Plot a combined violin and boxplot to show difference in distributions
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = percent_PHU_total, y = age_group) + ### 5.6.0 Swap the x and y axes
theme_bw()+
theme(text = element_text(size = 20)) + # set text size
# 4. Geoms
...(scale = "width", colour="darkviolet") +
geom_boxplot(alpha = 0.6) +
geom_quasirandom(aes(colour = ...))
## Error in filter(., period == "cumulative"): object 'covid_demographics_total.df' not found
From our above example, you can see that we blended a number of
geoms together. With a little working around, we can also
plot both violins and boxplots together in a multivariate setting! This
gives us the familiarity of the boxplot but also clearly displays the
theoretical distribution. Some steps to accomplish this:
scale
paramater to “width”.aes() parameters for our geoms
to ensure our points are plotted correctly.covid_demographics_total.df %>%
# Filter for only cumulative data
filter(period == "cumulative") %>%
# Select for just the important columns in our analysis
select(period, public_health_unit, age_group, percent_cases, percent_hospitalizations) %>%
# Pivot the modified table to capture the "stat_group" of percent_cases vs percent_hospitalizations
pivot_longer(cols=c(4:5), names_to = "stat_group", values_to = "percent_PHU_total") %>%
# Plot the data as a grouped boxplot
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=age_group, y = percent_PHU_total) +
theme(text = element_text(size = 20)) + # set text size
# 3. Scaling
scale_colour_manual(values=c("black", "black"))+ # we'll need this to fix our boxplots
# 4. Data
# multi-factor violin plots but keep the width consistent
geom_violin(scale="width", aes(fill=...)) +
# Boxplot but smaller width so they reside "within" the violin plot
geom_boxplot(aes(colour = ...), width=0.2,
position = ...,
outlier.shape=NA) + # Remove the outliers
### 5.7.0 Add in all of the data points
geom_quasirandom(dodge.width = 0.85, aes(group=stat_group), alpha = 0.8)
## Error in filter(., period == "cumulative"): object 'covid_demographics_total.df' not found
Section 5.7.0 Comprehension Question: Looking at the above code for our graph, why do you think we set the aes(fill/group = stat_group) aesthetic individually for some layers rather than directly in the aes() layer?
While the above visualization shows with some clarity the total distribution of our age groups, the messiness of outliers has invaded into some of the violin plots themselves. An experienced eye, however can still see that the bulk of the population centres around our internal boxplots offset by the stretched out look of our violins. We could, of course clean it up by removing some of the smaller PHU populations or removing outliers ahead of time.
Suppose, however, we wanted to add more levels to our data like
percent_male_cases and percent_female_cases?
With 7 age groups represented, things would start to get very crowded.
To accommodate all of that data, you could facet it into 4 groups based
on the statistic used but this would separate the data, which looks
sharper when you can see it all on a single plot.
Organizing multiple groups, across multiple categories is the domain
of the parallel coordinate plot. The GGally package is a
ggplot2 extension with the ggparcoord function
which allows us to look simultaneously at our 3 indicators (total cases,
deaths, and hospitalizations) for each age group, linking each PHU. For
each PHU at each indicator, we will draw a line connecting all values
across the age groups (coordinates). We will colour our lines based on
the category of the indicator.
Although GGally is an extension for
ggplot2, we actually have to wrestle with our data a little
bit more and put it back into a wider format. Otherwise we can treat the
plot similarly after making it by changing themes etc.
The parameters for ggparcoord() require:
columns: a numeric vector of the location of columns
that will represent the coordinates
(categories/groups/variables).
groupColumm: the column that defines the indicator
type (stat_group) for the observations.
scale: each coordinate should theoretically have
it’s own scale but that isn’t always possible depending on the data.
Instead, the default method of scaling is to normalize points in terms
of the standard deviation of the data along that coordinate. You can,
however, choose to just use the original scale with
globalminmax if the ranges of columns aren’t too
disparate.
To generate this visualization, we’ll want to:
1. Remake our list of PHUs ordered by descending case load.
2. Pivot our data to wide-format, giving each age group it’s own column in our data
# Generate the ordered list of PHUs by total cases
phu_by_cases <- phu_information.df %>%
slice(2:n()) %>% # drop the Ontario data
select(1, 2) %>% # Only take geographic_area and case_count
arrange(desc(case_count)) %>% # sort by descending order
select(1) %>% # grab just the PHU names
unlist() %>% # Unlist
as.character() # Convert to a character vector
# Look at the ordered list of PHUs
phu_by_cases
## [1] "Toronto"
## [2] "Peel"
## [3] "York Region"
## [4] "Ottawa"
## [5] "Durham Region"
## [6] "City of Hamilton"
## [7] "Halton Region"
## [8] "Region of Waterloo"
## [9] "Windsor-Essex County"
## [10] "Simcoe Muskoka"
## [11] "Niagara Region"
## [12] "Middlesex-London"
## [13] "Wellington-Dufferin-Guelph"
## [14] "Eastern Ontario"
## [15] "Sudbury & Districts"
## [16] "Southwestern"
## [17] "Kingston, Frontenac and Lennox & Addington"
## [18] "Brant County"
## [19] "Lambton"
## [20] "Thunder Bay"
## [21] "Haliburton, Kawartha, Pine Ridge"
## [22] "Haldimand-Norfolk"
## [23] "Hastings Prince Edward"
## [24] "Chatham-Kent"
## [25] "Leeds, Grenville & Lanark"
## [26] "Grey Bruce"
## [27] "Huron Perth"
## [28] "Peterborough"
## [29] "Porcupine"
## [30] "Algoma"
## [31] "Northwestern"
## [32] "North Bay Parry Sound"
## [33] "Renfrew County"
## [34] "Timiskaming"
covid_pcp.df <-
#head(covid_demographics_total.df)
covid_demographics_total.df %>%
# Filter for the top 15 PHUs by case load
filter(public_health_unit %in% phu_by_cases[1:15],
period == "cumulative"
) %>%
# Select for just the important columns
select(period, public_health_unit, age_group, percent_cases, percent_hospitalizations,
percent_male_cases, percent_female_cases) %>%
# Pivot the modified table to capture the "stat_group" of percent_cases
# percent_hospitalizations, and percent_x_cases
pivot_longer(cols=c(4:7), names_to = "stat_group", values_to = "percent_PHU_total") %>%
# We're going to reorder the factor ahead of time
# Probably a better way to do this but requiring more code
mutate(stat_group = factor(stat_group, levels=c("percent_hospitalizations", "percent_cases",
"percent_male_cases", "percent_female_cases"))) %>%
# Now we're going to pivot out wide to give each age_group it's own column
pivot_wider(names_from = age_group, values_from = c(percent_PHU_total)) %>%
# Fix the positioning of our factor levels
relocate('5 to 11', .after = '0 to 4')
## Error in filter(., public_health_unit %in% phu_by_cases[1:15], period == : object 'covid_demographics_total.df' not found
head(covid_pcp.df)
## Error in head(covid_pcp.df): object 'covid_pcp.df' not found
# Generate a ggplot object
# 1. Data and Geoms
ggparcoord(...,
columns=...,
groupColumn=...,
showPoints = ...,
scale=...) +
# 2. Aesthetics
theme_bw() +
theme(text = element_text(size = 20)) + # set text size
xlab("Age group") +
ylab("Percent of category by individual PHU") +
guides(colour = guide_legend(title="Category"))
## Error in ggparcoord(..., columns = ..., groupColumn = ..., showPoints = ..., : could not find function "ggparcoord"
group your data by multiple variables with the
interaction() functionSometimes in your data you may want specific parts of your data to be
grouped together, even if you aren’t planning on using that information
directly to determine a specific aesthetic category like
size, colour or fill.
For example, our parallel coordinate plot from above, could
be reproduced directly in ggplot. As long as we have our data in the
proper long format, we can use specific variables to ensure our data is
displayed properly by using the group aesthetic.
In the case of our data, we’d like to be sure we can group our data
lines by following their age_group and data categories (ie
percent_cases, percent_hospitalizations, etc.). In our wrangled
dataframe this will fall under the stat_group variable
after pivoting longer.
The problem with this strategy, however, is that group
does not normally allow for more than one variable - like most
aesthetics. We can solve this with the interaction()
function which will allow us to name multiple factors that will
be used to formally categorize observations as a composite of those
factors.
With that in mind, let’s mock up a quick parallel coordinate plot of our own!
covid_demographics_total.df %>%
# Filter for only cumulative data and reduce the PHU data to 15 groups
filter(public_health_unit %in% phu_by_cases[1:15],
period == "cumulative"
) %>%
# Select for just the important columns in our analysis
select(period, public_health_unit, age_group, percent_cases, percent_hospitalizations,
percent_male_cases, percent_female_cases) %>%
# Pivot the modified table to capture the "stat_group" of percent_cases vs percent_hospitalizations
pivot_longer(cols=c(4:7), names_to = "stat_group", values_to = "percent_PHU_total") %>%
ggplot() +
# 2. Aesthetics
aes(x = age_group, y = percent_PHU_total) +
# Set a basic theme
theme_bw()+
theme(text = element_text(size = 20)) + # set text size
# 4. Geoms
### 5.8.1 A line plot to join the points
geom_line(aes(group = ...,
colour = stat_group)) +
### 5.8.1 Add the points in also to help follow the connections
geom_point(aes(colour = stat_group))
## Error in filter(., public_health_unit %in% phu_by_cases[1:15], period == : object 'covid_demographics_total.df' not found
Looks nearly the same AND you don’t have to remember how to properly
format your data for using ggparcoord!
From our above data we see some very interesting points looking at our 60-79 and 80+ categories.
Remember all of this data represents the proportion or share of each age group for each particular indicator. If, however, we really want to understand which age group has the worst hospitalization or case rate, we need to drill down into each age group.
Let’s generate one last visualization and look at the probability of hospitalization after contracting SARS-CoV-2. We’ll calculate those values on the cumulative data and then graph them.
Note: We are working with cumulative values across the entire pandemic - this doesn’t give us a sense of the effect of external effects such as vaccination rates, or differences between variants!
# Start with our covid demographics data
covid_demographics_total.df %>%
# Ungroup the data
ungroup() %>%
# Filter for cumulative data
filter(period == "cumulative") %>%
# Just grab the columns we want to use
select(period, public_health_unit, age_group, total_cases, total_hospitalizations_count) %>%
# Generate our new calculations and save to our data frame
mutate(prob_hospitalization = ...) %>%
# Drop any potential NA data due to division by 0
filter(complete.cases(.)) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = ..., y=...) +
theme(text = element_text(size = 20)) + # set text size
theme(legend.position = "none") +
xlab("age group") +
ylab("probability of hospitalization") +
ggtitle("Probability of hospitalization after contracting SARS-CoV-2 from multiple PHUs across age groups") +
# 4. Geoms
# Add a boxplot
geom_boxplot(varwidth=TRUE, outlier.shape=...) +
# Add our data points
geom_quasirandom(aes(colour = age_group), size = 3, alpha = 0.7)
## Error in filter(., complete.cases(.)): '...' used in an incorrect context
We’ve covered a number of key plots today, including when and how to use them. Next week we’ll revisit some of these plots and spruce them up with extra touches that will take them that extra distance. Below you’ll find a summary of what we’ve discussed today.
| Plot | Key Features | Notes |
|---|---|---|
| Scatterplot | Good for exploring relationships between variables | Bubbleplots add an extra dimension to your data |
| Barplot | Present values across groups. | Presenting proportions, small sample sizes |
| Stack categories for extra dimension | Does not dissect individual distributions | |
| Nightingale plot | Circular-wedge barplot, same properties as barplot | Presenting data over unordered groups |
| Visual emphasis on outer area size may mislead reader | ||
| Racetrack plot | Circular-ringed barplot, same properties as barplot | Looking for a more compact way to show barplots |
| Calculate length by radians as outer rings are “stretched” | ||
| Density plot | Theoretical distribution of your sample data | Minimum sample size 4 but 30 is more reliable |
| Can plot up to 5 distributions on same axis | ||
| Tails can produce “ghost” data | ||
| Ridgeplots | Allows tighter visualizations of multiple densities | Good way to pack more KDEs into a smaller area |
| Same properties as KDEs | No real control of outliers | |
| Similar “ghost” data issues | ||
| Box and whisker plot | Summarize distributions with 5 parameters | Popular and compact presentation of simple populations |
| Minimum sample size = 5 | ||
| Does not properly visualize multi-modal data | ||
| Violin plots | Boxplot format with KDE violin shape | Compact representation of distribution shape |
| Less popular with nuanced interpretation | ||
| Inherits “ghost” data and other properties of KDE | ||
| DOES interpret multi-modal populations | ||
| Parallel coordinate plot | Visual representation of multivariate data | Connects trends across groups |
| Related data can be connected linearly | Not limited by number of samples | |
| Can help identify trends within multicategorical data |
This week’s assignment will be found under the current lecture folder under the “assignment” subfolder. It will include an R markdown notebook that you will use to produce the code and answers for this week’s assignment. Please provide answers in markdown or code cells that immediately follow each question section.
| Assignment breakdown | ||
|---|---|---|
| Code | 50% | - Does it follow best practices? |
| - Does it make good use of available packages? | ||
| - Was data prepared properly | ||
| Answers and Output | 50% | - Is output based on the correct dataset? |
| - Are groupings appropriate | ||
| - Are correct titles/axes/legends correct? | ||
| - Is interpretation of the graphs correct? |
Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.
You can save and download the markdown notebook in its native format. Submit this file to the the appropriate assignment section by 12:59 pm on the date of our next class: March 28th, 2024.
Revision 1.0.0: created and prepared for CSB1021H S LEC0141, 03-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.0.1: edited and prepared for CSB1020H S LEC0141, 03-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.0.2: edited and prepared for CSB1020H S LEC0141, 03-2023 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 2.0.0: Revised and prepared for CSB1020H S LEC0141, 03-2024 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Public Health Ontario data: https://www.publichealthontario.ca/en/data-and-analysis/using-data/open-data
Anscombe’s Quartet: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Choosing the right chart type: https://extremepresentation.typepad.com/blog/2006/09/choosing_a_good.html
A layered grammar of graphics by Hadley Wickham: http://vita.had.co.nz/papers/layered-grammar.html
read_delim() column types and information: https://readr.tidyverse.org/reference/read_delim.html
Pie chart rules: https://www.1ka.si/d/en/help/faq/what-is-the-maximum-number-of-categories-to-display-using-pie-charts
Kernel density estimation: https://stats.stackexchange.com/questions/76948/what-is-the-minimum-number-of-data-points-required-for-kernel-density-estimation
Ridgeline plots: https://www.data-to-viz.com/graph/ridgeline.html
Visualizing samples with boxplots: https://www.nature.com/articles/nmeth.2813.pdf?origin=ppub#:~:text=Whereas%20histograms%20require%20a%20sample,render%20it%20even%20more%20informative.
Reasons to use a violin plot: https://blog.bioturing.com/2018/05/16/5-reasons-you-should-use-a-violin-graph/
Parallel coordinate plot parameters: https://www.rdocumentation.org/packages/GGally/versions/1.5.0/topics/ggparcoord
Lots of more plots in R!: https://www.r-graph-gallery.com/index.html
The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.
From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.
For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.